DiSC: A Distributed Single-Linkage Hierarchical Clustering Algorithm using MapReduce

نویسندگان

Chen Jin

Md. Mostofa

Ali Patwary

Ankit Agrawal

William Hendrix

Wei-keng Liao

Alok Choudhary

چکیده

Hierarchical clustering has been widely used in numerous applications due to its informative representation of clustering results. But its higher computation cost and inherent data dependency prohibits it from performing on large datasets efficiently. In this paper, we present a distributed singlelinkage hierarchical clustering algorithm (DiSC) based on MapReduce, one of the most popular programming models used for scalable data analysis. The main idea is to divide the original problem into a set of overlapped subproblems, solve each subproblem and then merge the sub-solutions into an overall solution. Further, our algorithm has sufficient flexibility to be used in practice since it runs in a fairly small number of MapReduce rounds through configurable parameters for data merge phase. In our experiments, we evaluate the DiSC algorithm using synthetic datasets with varied size and dimensionality, and find that DiSC provides a scalable speedup of up to 160 on 190 computer cores.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Incremental, distributed single-linkage hierarchical clustering algorithm using mapreduce

Single-linkage hierarchical clustering is one of the prominent and widely-used data mining techniques for its informative representation of clustering results. However, the parallelization of this algorithm is challenging as it exhibits inherent data dependency during the hierarchical tree construction. Moreover, in many modern applications, new data is continuously added into the already huge ...

متن کامل

Choosing the Best Hierarchical Clustering Technique Based on Principal Components Analysis for Suspended Sediment Load Estimation

1- INTRODUCTION The assessment of watershed sediment load is necessary for controling soil erosion and reducing the potential of sediment production. Different estimates of sediment amounts along with the lack of long-term measurements limits the accessibility to reliable data series of erosion rate and sediment yield. Therefore, the observed data of suspended sediment load could be used to ...

متن کامل

Hierarchical clustering of large text datasets using Locality-Sensitive Hashing

In this paper, we present a hierarchical clustering algorithm of the large text datasets using Locality-Sensitive Hashing (LSH). The main idea of the LSH is to “hash” items several times, in such a way that similar items are more likely to be hashed to the same bucket than dissimilar are. The main drawback of the conventional hierarchical algorithms is a large time complexity (e.g. Single Linka...

متن کامل

PARABLE: A PArallel RAndom-partition Based HierarchicaL ClustEring Algorithm for the MapReduce Framework

Large datasets, of the order of petaand terabytes, are becoming prevalent in many scientific domains including astronomy, physical sciences, bioinformatics and medicine. To effectively store, query and analyze these gigantic repositories, parallel and distributed architectures have become popular. Apache Hadoop is a distributed file system that provides support for dataintensive applications. I...

متن کامل

Finding Connected Components on Map-reduce in Logarithmic Rounds

Given a large graph G = (V,E) with millions of nodes and edges, how do we compute its connected components efficiently? Recent work addresses this problem in map-reduce, where a fundamental trade-off exists between the number of mapreduce rounds and the communication of each round. Denoting d the diameter of the graph, and n the number of nodes in the largest component, all prior techniques for...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2013

DiSC: A Distributed Single-Linkage Hierarchical Clustering Algorithm using MapReduce

نویسندگان

چکیده

منابع مشابه

Incremental, distributed single-linkage hierarchical clustering algorithm using mapreduce

Choosing the Best Hierarchical Clustering Technique Based on Principal Components Analysis for Suspended Sediment Load Estimation

Hierarchical clustering of large text datasets using Locality-Sensitive Hashing

PARABLE: A PArallel RAndom-partition Based HierarchicaL ClustEring Algorithm for the MapReduce Framework

Finding Connected Components on Map-reduce in Logarithmic Rounds

عنوان ژورنال:

اشتراک گذاری